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Abstract 



We study the regret of optimal strategies for online convex optimization games. Using von Neumann's 
minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the 
behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the 
maximum, over joint distributions of the adversary's action sequence, of the difference between a sum 
of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a 
natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave 
functional — the minimizer over the player's actions of expected loss — defined on a set of probability 
distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal 
strategy for a variety of online learning problems. Our method provides upper bounds without the need 
to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary. 

1 Introduction 

Within the Theory of Learning, two particular topics have gained significant popularity over the past 20 years: 
Statistical Learning and Online Adversarial Learning. Papers on the former typically study generalization 
bounds, convergence rates, complexity measures of function classes — all under the assumption that the 
examples are drawn, typically in an i.i.d. manner, from some underlying distribution. Working under 
such an assumption. Statistical Learning finds its roots in statistics, probability theory, high-dimensional 
geometry, and one can argue that the main questions are by now relatively well-understood. 

Online Learning, while having its origins in the early 90's, recently became a popular area of research 
once again. One might argue that it is the assumptions, or lack thereof, that make online learning attractive. 
Indeed, it is often assumed that the observed data is generated maliciously rather than being drawn from 
some fixed distribution. Moreover, in contrast with the "batch learning" fiavor of Statistical Learning, the 
sequential nature of the online problem lets the adversary change its strategy in the middle of the interaction. 
It is no surprise that this adversarial learning seems quite a bit more difficult than its statistical cousin. The 
worst case adversarial analysis does provide a realistic modeling in learning scenarios such as network security 
applications, email spam detection, network routing etc., which is largely responsible for the renewed interest 
in this area. 

Upon a review of the central results in adversarial online learning — most of which can be found in 
the recent book Cesa-Bianchi and Lugosi [S] — one cannot help but notice frequent similarities between the 
guarantees on performance of online algorithms and the analogous guarantees under stochastic assumptions. 



However, discerning an explicit link has remained elusive. Vovk |17j notices this phenomenon: "for some 
important problems, the adversarial bounds of on-line competitive learning theory are only a tiny amount 
worse than the average-case bounds for some stochastic strategies of Nature." 

In this paper, we attempt to build a bridge between adversarial online learning and statistical learning. 
Using von Neumann's minimax theorem, we show that the optimal regret of an algorithm for online convex 
optimization is exactly the difference between a sum of minimal expected losses and the minimal empirical 
loss, under an adversarial choice of a stochastic process generating the data. This leads to upper and lower 
bounds for the optimal regret that exhibit several similarities to results from statistical learning. 

The online convex optimization game proceeds in rounds. At each of these T rounds, the player (learner) 
predicts a vector in some convex set, and the adversary responds with a convex function which determines 
the player's loss at the chosen point. In order to emphasize the relationship with the stochastic setting, we 
denote the player's choice as f G !F and the adversary's choice as z G Z. Note that this differs, for instance, 
from the notation in [1]. 

Suppose ^ is a convex compact class of functions, which constitutes the set of Player's choices. The 
Adversary draws his choices from a closed compact set Z. We also define a continuous bounded loss function 
£ : Z X -^R and assume that £ is convex in the second argument. Denote by i{!F) = {£{■, f) ■ f € ^} the 
associated loss class. Let ^ be the set of all probability distributions on Z. Denote a sequence {Zi, . . . , Zt) 
by Z^. We denote a joint distribution on Z^ by a bold-face p and its conditional and marginal distributions 
by (-jZ*^^) and p™, respectively. 

The online convex optimization interaction is described as follows. 

Online Convex Optimization (OCO) Game 

At each time step t — 1 to T, 

• Player chooses ft G 

• Adversary chooses zt G Z 

• Player observes Zt and suffers loss £{zt, ft) 

The objective of the player is to minimize the regret 



It turns out that many online learning scenarios can be realized as instances of OCO, including prediction 
with expert advice, data compression, sequential investment, and forecasting with side information (see, for 
example, [5])- 

2 Applying von Neumann's minimax theorem 

Define the value of the OCO game — which we also call the minimax regret — as 



The OCO game has a purely "optimization" flavor. However, applying von Neumann's minimax theorem 
shows that its value is closely related to the behavior of the empirical minimization algorithm in a stochastic 
process setting. 

Theorem 1. Under the assumptions on T , Z , and I given in the previous section, 
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Y^i{ztjt)~\fY.^{ztj). 
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where the supremum is over all joint distributions p on Z'^ and the expectations are over the sequence of 
random variables {Zi, . . . , Zt} drawn according to p. 

The proof relies on the following version of von Neumann's minimax theorem; it appears as Theorem 7.1 
in [5]. 

Proposition 2. Let M{x, y) denote a bounded real-valued function on X x y, where X and y are convex 
sets and X is compact. Suppose that M{-, y) is convex and continuous for each fixed y E y and M{x, ■) is 
concave for each x £ X. Then 

inf supAf(a;,y) = sup inf M{x,y). 

Proof of Theorem [ij For the sake of clarity, we prove the Theorem for T — 2. The proof for T > 2 
uses essentially the same steps while the notation is less transparent. We postpone the general proof to the 
Appendix. 
We have 



inf sup inf sup ^(^t, /*) - inf ^(zt, /) 



Consider the last optimization choice Z2. Suppose we instead draw Z2 according to a distribution, and 
compute the expected value of the quantity in the parentheses. Then it is clear that maximizing this expected 
value over all distributions on Z is equivalent to maximizing over Z2, with the optimizing distribution 
concentrated on the optimal point. Hence, 



^2 = inf sup inf sup E^^p 



inf ^^(zt,/) 



(3) 



We now apply Proposition |2] to the last inf / sup pair in (|3| with 

"2 2 ■ 

M(/2,P2) =E.,^p, J2l{zuft) - inf 

J=i ' t=i 

which is convex in /2 (by assumption) and linear in p2- Moreover, the set T is compact, and both J- and ,f3^ 
are convex. We conclude that 



?2 = inf sup inf sup E Z2~P2 



inf sup sup inf E Z2~P2 



.4=1 
■ 2 



t=l 
2 



5^£(z,,/i)- inf V£(z,,/) 



inf sup sup 

fi^^ ziez p2e3^ 



e{ziji) + inf E,^p,£(z,/2) -E,,^p, inf ^£(zt,/) 



inf sup 



sup < inf E 2;^p2£(z, /2) — E 22^p2 



Now consider the supremum over Zi. Using the same argument as before, we have 

M2 = inf supA(zi,/i)= inf sup E ^^^p^ A(zi, /i). 
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Observe that the function 

is convex in /i and Hnear in pi. Appeahng to Proposition |2] again, we obtain 
^2 = inf sup E^j^piA(zi,/i) 



: sup illf E^^^p^ 



^(2i,/i)+ sup <^ inf E,^p,^(z,/2)-E 



: sup ( inf E^^pj^(z,/i) ) +E;,^^p^ sup < inf E 2^p,^(z, /a) -E^^^p, inf I 
: sup E^^^p^ sup I ( inf E^^p^^(z,/i) ) + inf E ^^p,£(z, /s) - E ^^^^^ inf I 



A key observation that makes the above argument valid is that the choice of /i does not depend on p2 , and 
by the same token p2 is not influenced by a particular choice of /i. Now, it is easy to see that maximizing 
over pi, then averaging over zi ~ pi, and then maximizing over p2{-\zi) is the same as maximizing over joint 
distributions p on (zi, Z2) and averaging over zi. Thus, 

^2 = supE,,^pJ f inf E,^p,£(z,/i)) + f inf E ,^p,^(z, /z)) - E ,,^p, inf V£(zt,/)| 



which proves the Theorem for T = 2. 



□ 



We can think of Eq. ([2| as a game where the adversary goes first. At every round he "plays" a distribution 
and the player responds with a function that minimizes the conditional expectation. 

We remark that we can allow the player to choose /t's non-deterministically in the original OCO game. 
In that case, the original infimum should be over distributions on T . We then do not need convexity of ^ in 
/ e T^s in order to apply von Neumann's theorem, and the resulting expression for the value of the game is 
the same. 



3 First Steps 

The present work focuses on analyzing the expression in Equation ([2| for a range of different choices of Z 
and T , as well as for various assumptions made about the loss function i. We are not only interested in 
upper- and lower-bounding the value of the game i^T, but also in determining the types of distributions p 
that maximize or almost maximize the expression in ([2|. To that end, define p-regret as 



E 



^minE [^(Z„/0|ZrV™^E^(^*'/) 
.t=i t=i 



(4) 



for any joint distribution p of (zi, . . . , zt) £ ■ In this section we will provide an array of analytical tools 
for working with ^t{p)- 

3.1 Regret for IID and Product Distributions 

Let us start with a simple example. Suppose Z — [0,1], — [0, 1], and £{z, f) — \z — f\. For this game, we 
might try various strategies p and compute the value ^r(p)- One choice of the joint distribution p could 
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be to put mass on disjoint intervals of Z at each round. Suppose the conditional distribution at time 
■pt ('l^i""^) is uniform on [^^, ^\ . Then, 



/; = argminE {i{ZJt)\Z\-^\ = 



the midpoint of the interval, while the minimizer over the data / — argmin/gjr ^^-^ ^(Zt, /) is close to |. 
It is easy to check that ^t{p) is negative and linear in T. 

This example suggests that the chosen distribution p is not optimal, as it forces the best decision in 
hindsight to be bad as compared to the intermediate decisions. The root of the problem appears in the 
disjoint nature of the support of the distributions. One might wonder whether this suggests that the optimal 
distribution should, in fact, be the same between rounds. Hence, a natural next step is to consider i.i.d. as 
well as general product distributions p as candidates for maximizing ^^(p). 

The following Lemma states that p-regret is non-negative for any choice of an i.i.d. distribution. 

Lemma 3. For any i.i.d. distribution p, ^t{p) > 0. Hence, > 0. 
Proof. For an i.i.d. distribution Eq. Q becomes 

T T 

i=l ^ t=l 

1 ^ 

= minE [£{Z, /)] - E min - ^ ^(^t, /) 

1 ^ 

> minE [£{Z, /)] - minE - ^ ^(^*, /) 

t=i 

= 

where the inequality is due to the fact that E min < min E . □ 

Observe that ^t(p) for an i.i.d. process is the difference between the minimum expected loss and the 
expectation of the empirical loss of an empirical minimizer. 

With the goal of studying various types of distributions, we now define the following hierarchy: 

^j^'-'^ := sup ^t(p); ^^'^'p- := sup ^t(p), 

p—pX...Xp p—piX...XpT 

where p,pi, . . . ,pt are arbitrary distributions on Z. It is immediately clear that 

< ^^'•'^ < ^i^'^'^P- < (5) 

We will see that, given particular assumptions on J^, Z and £, some of the gaps in the above hierarchy are 
significant, while others are not. Before continuing, however, we need to develop some tools for analyzing 
the minimax regret. 

3.2 Tools for a General Analysis 

We now introduce two new objects that help to simplify the expression in ^ as well as derive properties of 
^t(p). 

Definition 4. Given sets J^, Z, we can define the minimum expected loss functional <i> as 

:= inf Ez^p[e{ZJ)] 

where p is some distribution on Z. 
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Defining an inner product {h,p) = h{z)dp{z) for a distribution p, we observe that = inf f^yr(^£(^-^ f)iP)- 

Definition 5. For any Zi, . . . , Zt G Z"^ , we denote Pt ~ ^ Izt('); empirical distribution. 

With this additional notation, we can rewrite Q as 

T 

^^t{p) - ^i2^Hpt {■\zI-'))~ehPt). (6) 

t=l 

Thus, the Adversary's task is to induce a large deviation between the average sequence of conditional 
distributions {pt (^■\Zl~^^} and an empirical sample Pt from these conditionals, where the deviation is 
defined by way of the functional $. It is easy to check that <& and are concave, as the next lemma shows. 
The proof is postponed to the Appendix. 

Lemma 6. The functional <&(•) is concave on the space of distributions over Z and ^t(') is concave with 
respect to joint distributions on Z^ . 

It is indeed concavity of (f> that is key to understanding the behavior of J^t- A hint of this can already 
be seen in the proof of Lemma [3] where the only inequality is due to the concavity of the min. In the next 
section, we show how this description of regret can be interpreted through a Bregman divergence in terms 
of 



3.3 Divergences and the Gap in Jensen's Inequality 

We now show how to interpret regret through the lens of Jensen's Inequality by providing yet another 
expression for it, now in terms of Bregman Divergences. We begin by revisiting the i.i.d. case p = p'^ = 
p X . . . X p, for some distribution p on Z. Equation ([6]) simplifies to a very natural quantity, 

^Mt{p^)^<^{p)~¥.<^{Pt). (7) 

Notice that Pt is a random quantity, and in particular that EP^ = P- As <&(•) is concave, with an 
immediate application of Jensen's Inequality we obtain J^Tip"^) ^ 0. For arbitrary joint distributions p, we 
can similarly interpret regret as a "gap" in Jensen's Inequality, albeit with some added complexity. 

Definition 7. // F is any convex differentiabl^ functional on the space of distributions on Z, we define 
Bregman divergence with respect to F as 

VF{q,p)^F{q)^F{p)^ {VF{p) ,q^p). 

If F is non-differentiable, we can take a particular subgradient Vp G dF{p) in place of VF{p). Note that 
the notion of subgradients is well-defined even for infinite-dimensional convex functions. Having choserj^ a 
mapping p Vp (z dF(p), we define a generalized divergence with respect to F and Vp as 

VFiq,p) = Fiq) - F{p) - {Vp,q-p). 

Throughout the paper, we focus only on the divergence and thus we omit — <f> from the notation 

for simplicity. 

Given the definition of divergence, it immediately follows that, for a random distribution q, 

$(Eg) - E$(q) = EV{q,Eq) 



^Herc, wc moan difTcrontiable with respect to the Frechet or Gateaux derivative. We refer the reader to [7] for precise 
definitions of functional Bregman Divergences. 

^The assumption of compactness of .7^, together with the characterization of the subgradient set in Section |4.2[ allow us, for 
instance, to define the mapping p i— > Dp by putting a uniform measure on the subgradient set and defining Vp to be the expected 
subgradient with respect to it. In fact, the choice of the mapping is not important, as long as it does not depend on q. 
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since the linear term disappears under the expectation. This simple observation is quite useful; notice we 
now have an even simpler expression for i.i.d. regret ([t]): 

In other words, the p-^-regret is equal to the expected divergence between the empirical distribution and its 
expectation. This will be a starting point for obtaining lower bounds for 3iT- For general joint distributions 
p, let us rewrite the expression in ([6| as 

Et^yE$(ft (•|Zf-i))-E$(PT), 

where we replaced the average with a uniform distribution on the rounds. Roughly speaking, the next 
lemma says that one can obtain Ei>(PT) from Et^[/E$(pt (-jZ*^^)) through three applications of Jensen's 
inequality, due to various expectations being "pulled" inside or outside of <i>. 

Lemma 8. Suppose p is an arbitrary joint distribution. Denote by pt {^■\Z\^^^ and p^^ the conditional and 
marginal distributions, respectively. Then 



(8) 



whe 



t 



and pI" — E[pt {^\Z\ is the marginal distribution at time t. 

Proof. The marginal distribution satisfies Epj (-jZ^^^) ~ p™, and it is easy to see that EiV = ^^tPT- 
Given this, we see that 



^(p) =E, 



l^,j2^{p,{.\Zl-'))^^{P^) 



t=i 



-A, 



= E, 



}^Y.{^{p,{-\zl-^))~<^{pT)] 



-An 



Pt 



-A, 



E, 



□ 



This lemma sheds some light on the influence of an i.i.d. vs. product vs. arbitrary joint distribution on 
the regret. For product distributions, every conditional distribution is identical to its marginal distributions, 
thus implying Ai = 0. Furthermore, for any i.i.d. distribution, each marginal distribution is identical to 
the average marginal, thus implying that Ag — 0. With this in mind, it is tempting to assert that the 
largest regret is obtained at an i.i.d. distribution, since transitions from i.i.d to product, and from product 
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to arbitrary distribution, only subtract from the regret value. While appealing, this is unfortunately not 
the case: in many instances the final term, A2, can be made larger with a non-i.i.d. (and even non-product) 
distribution, even at the added cost of positive Aq and Ai terms, so that ~ o{^t) as a function of T. 

In some cases, however, we show that a lower bound on the regret can be obtained with an i.i.d. distribution 
at a cost of only a constant factor. 

4 Properties of $ 

In statistical learning, the rate of decay of prediction error is known to depend on the curvature of the loss: 
more curvature leads to faster rates (see, for example, [lU UHl [5] ) > and slow (e.g. fl{T~-^/^)) rates occur when 
the loss is not strictly convex, or when the minimizer of the expected loss is not unique |12l 114] . There is a 
striking parallel with the behavior of the regret in online convex optimization; again the curvature of the loss 
plays a central role. Roughly speaking, if £ is strongly convex or exp-concave, second-order gradient-descent 
methods ensure that the regret grows no faster than logT (e.g. [8]); if £ is linear, the regret can grow no 
faster than \/T (e.g. [H]); intermediate rates can be achieved as well if the curvature varies [5]. 

The previous section expresses regret as a sum of divergences under <&, and that suggests that the 
curvature of <i> should be an important factor in determining the rates of regret. We shall see that this is 
the case: curvature of <i> leads to large regret, while flatness of $ implies small regret. 

We will now show how properties of the loss function class determine the curvature of <i>. In later sections 
we will show how such curvature properties lead directly to particular rates for First, let us provide a 
fruitful geometric picture, rooted in convex analysis. It allows us to see the function roughly speaking, 
as a mirror image of the function class. 

4.1 Geometric interpretation of $ 

In general, the set Z is uncountable, so care must be taken with regard to various notions we are about 
to introduce. We refer the reader to Chapter 10 of |4] for the discussion of finite vs infinite-dimensional 
spaces in convex analysis. Since Z is compact by assumption, we can discretize it to a fine enough level 
such that the upper and lower bounds of this paper hold, as long as the results are non-asymptotic. In the 
present Section, for simplicity of exposition, we will suppose that the set Z is finite with cardinality d. This 
assumption is required only for the geometric interpretation; our proofs are correct as long as Z is compact. 

Hence, distributions over the set Z are associated with c?-dimensional vectors. Furthermore, each f G !F 
is specified by its d values on the points. We write £f € M'' for the loss vector of /, £{■,/)■ Let us denote 
the set of all such vectors by . We then have 

-$(p) = - inf Ep£{Z, /) = sup {-£f,p) = f7_£(^)(p), 

where (Ts(a;) = sup^g_5(s, a;) is the support function for the set S. This function is one of the most basic 
objects of convex analysis (see, for instance, [S]). It is well-known that ag = fXcoS ! in other words, the 
support function does not change with respect to taking convex hull (see Proposition 2.2.1, page 137, [S]). 
To this end, let us denote S = co[-£{T)] C W^. 

It is known that the support function is sublinear and its epigraph is a cone. To visualize the support 
function, consider the W'' x M space. Embed the set S* C M'^ in M"^ x {1} (see Figure[l]). Then construct the 
conic hull of x {1}. It turns out that the cone which is dual to the constructed conic hull is the epigraph 
of the support function ag. The dual cone is the set of vectors which form obtuse or right angles with all 
the vectors in the original cone. Hence, one can visualize the surface ag as being at the right angles to the 
conic hull of S* x {1}. Now, the function <i> is just the restriction of ag to the simplex. We can now deduce 
properties of <1> from properties of the loss class. 
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Figure 1: Dual cone as the epigraph of the support function. $ is the restriction to the simplex. 
4.2 Differentiability of $ 

Lemma 9. The subdifferential set of <f> is the set of expected minimizers: 

d^p) ^{ef.fe argminEp£(Z, /)}. 

Hence, the functional ^ is differentiahle at a distribution p iff axgimiif^yrE p£{z, f) is unique. 

Proof. We have seen that — <f> is the support function of co[—£{J-)] restricted to the probability simplex. 
The subdifferential set of the support function is the support set, that is, the set of points achieving the 
supremum in the definition of support function. 

By examining Figure [l] one can see why this statement is correct: roughly speaking, a point on the 
boundary of S, which supports some distribution p, serves as a normal to a hyperplane tangent to $ at p. 
The precise proof of this fact, found in Proposition 2.1.5 in [S], can be extended to the infinite-dimensional 
case as well. 

For <I> to be differentiable, the subdifferential set has to be singleton. This immediately gives us the 
criterion for the differentiability of <I> stated above. □ 

In particular, for <i> to be differentiable for all distributions, the loss function class should not have a "face" 
exposed to the origin. This geometrical picture and its implications will be studied further in Section [6] 

It is easy to verify that strict convexity of £{z, f) in / implies uniqueness of the minimizer for any p and, 
hence, differentiability of 



4.3 Flatness of $ through curvature of (. 

In this section we show that curvature in the loss function leads to flatness of We would indeed expect 
such a result to hold since regret decaying faster than 0{T~^/'^) is known to occur in the case of curved losses 
(e.g. [2]), and decomposition ^ suggests that this should imply flatness of <i>. More precisely, we show that 
if i{f, z) is strongly convex in / with respect to some norm || • ||, then <i> is strongly flat with respect to the 
Hi norm on the space of distributions. Before stating the main result, we provide several definitions. 

Definition 10. A convex function F is a-flat (or a-smoothj with respect to a norm \\ ■ \\ when 

F{y) - F{x) < {VF{x),y - x) + a\\x - y\\^ (9) 

for all x,y. We will say that a concave function G is a-flat if —G satisfies ([9|. 

Let us also recall the definition of £i (or variational) norm on distributions. 
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fp fq 



Figure 2: The two terms in the decomposition (10) 



Definition 11. For two distributions p, q on Z, we define 



\\P - 9l|i = J \dp{z) - dq{z)\. 

Theorem 12. Suppose £{z, /) is a-strongly convex in f, that is, 

f + 9\^i{zj) + £{z,g) a 



I z 



< 



11/ -.911 



2 J - 2 

for any z Cz Z and f,g £ T . Suppose further that £ is L ~ Lipschitz , that is, 

\£{zJ)~£{z,g)\<L\\f-g\\. 

Under these conditions, the ^-functional is ^^-flat with respect || • 

The proof uses the following lemma, which shows stability of the minimizers. Its proof appears in the 
Appendix. 

Lemma 13. Fix two distributions p, q. Let fp and fq be the functions achieving the minimum in '^{p) and 



^{q), respectively. Under the conditions of Theorem 12 
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\\f,-fq\\<-\\p-q\\l- 
a 

Proof of Theorem \lS\ We have 

a>(p) - a>(q) - E p£{z, fp) - E q£{z, /,) = (E p£{z, fp) - E q£{z, fp)) + {E q£{z, fp)-E q£{z, /,)) . (10) 
Let us first study the second term in the expression above. As fp is the minimizer of Ep^(z, /), we have: 

Ep[£{z,fp)-£{zJq)]<Q 

So 

E, [£{z, fp) ~ £{z, /,)] < E g [£{z, fp) ~ £{z, /,)] - Ep [£{z, fp) - £{z, /,)] 

{£{zjp) ~ £{zjq)){dq{z) - dp{z)) 

<L I \\fp~fq\\\dpiz)-dq{z)\. 
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Using Lemma [TS] we get: 



2/2 

E,[£(2,/p)-£(z,/,)] < \\p-q\\l (11) 

a 



As for the first term in (10 1, 

E/(z,/p)-E,^(z,/p)= / £{zjp){dp{z)-dq{z))^{l{-jp),{p~q)). (12) 

J z 

The fact that i'(-, /p) is a subdifFerential of $ at p is proved in the appendix. We conclude that the first and 
the second terms in (llOl) are the first and the second order terms in the expansion of <f>. □ 



We remark that we can arrive at above results by explicitly considering the dual function <&*, proving 
strong convexity of <&* with respect to || • ||oo (which follows from our assumption on €), and then concluding 
strong flatness of <i> with respect to || • ||i. This is indeed the main intuition at the heart of our proof. 



5 Upper Bounds on 

In this section, we exhibit two general upper bounds on that hold for a wide class of OCO games. The first 
bound, which holds when the functional <I> is differentiable and not too curved, is of the form = O(logT). 
The second, which holds for arbitrary <I>, e.g. where the functional may even have a non-differentiability, is 
stated in terms of the Rademacher complexity of the class T . Such Rademacher complexity results imply a 
regret upper bound on the order of ^/T. 

An intriguing observation is that these bounds are proved without actually exhibiting a strategy for the 
Player, as is typically done. This illustrates the power of the minimax duality approach: we can prove the 
existence of an optimal algorithm, and determine its performance, all without providing its construction. 

Throughout, we shall refer to 0{VT) rates as "slow rates" and O(logT) as fast rates. These notions are 
borrowed from the statistical learning literature, where fast rates of convergence of the empirical minimizer 
to the best in class arise from certain assumptions, such as convexity of the class and square loss. The 
slow rates, on the other hand, are exhibited by the situations where the expected minimizer of the loss is 
non-unique. 



5.1 Fast Rates: Exploiting the Curvature 

For differentiable $ with bounded second derivative, we can prove that the regret grows no faster than 
logarithmically in T. Of course, rates of logT have been given previously [U [THJ ^7\. We build upon these 
results in the present work by showing that logarithmic regret must always arise when $ satisfies a fiatness 
condition. 

Theorem 14. Suppose the (f> functional is differentiable and a-flat with respect the norm || • ||i on . Then 
< 4a log T. 

We immediately obtain the following corollary. 
Corollary 15. Suppose functions £(z, f) are a-strongly convex an d L-Lipschitz in f. Then Mr < — log T. 



Furthermore, as we show in Section |7.3| the logT bound is tight for quadratic functions; there is an 
explicit joint distribution for the adversary which attains this value. 
The proof of Theorem [14] involves the following lemma. 



Lemma 16. The p-regret can be upper-bounded as 

^t(p) < E 
where A(-) = (^) A-i(-) + ]pt (-l^r'). 



■ T 



(A, A) 
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Proof. Consider the following difference: 

1 



6t ■■ = -E$ (pt(-|^i ')) -E^[Pt 



For the first difference we use concavity of <I>. The second difference can be written as a divergence because 
the linear term vanishes in expectation. Indeed, 



1 



E {V<i> {Pt) , -{IzA-) - PT{-\Zt } = 

because the gradient does not depend on Zx, while 

EzA^zA-)\zI~']^Pt{-\Z^-'). 



Hence, 



and so 



St<-{ ]E<^{Pt^,)+EV[Pt,Pt 



± 

?t(p) = ^^E* (pt(.|Zj-i)) - TE* (Pt) 

t=i 

T-l 

= Y,E<^{pt{-\Z[-'))+T5T 

t=i 

T-l 

<Y,E<^{pt{-\Z{-^))-{T-\)E<^{PT-i) + TEv(PT,PT 



□ 

Before proceeding, note that we may interpret Pt as the conditional expectation of the uniform distribu- 
tion Pt given Zi, . . . , Zt-i- The flatness of $ will allow us to show that Pt deviates very slightly from Pt in 
expectation — indeed, by no more than O(p-). This is crucial for obtaining fast rates: for general $ (which 

may be non-differentiable), it is natural to expect T) {PuP^ = ^{l/t). In this case, the regret would be 

bounded by 0{J2t ^ ' 1/^) = 0{T), rendering the above lemma useless. 

Proof of Theorem I14[ We have that the divergence terms in Lemma [T6| are bounded as 

2 



t-v(Pt,Pt)<ta ^-1zA-)-\pMZ[-^) 



< 



4a 
T 



because the variational distance between distributions is bounded by 4: 



\SzAz)-dpt{z\Z^'^)\\ <4 



□ 
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5.2 General vT Upper Bounds 



We start with the definition of Rademacher averages, one of the central notions of complexity of a function 
class. 



Definition 17. Denote by 



RadT{l{F)) 



E sup 



t=i 



the data-dependent Rademacher averages of the class £{J-) ■ Here, ei . . . ex are independent Rademacher 
random variables (uniform on {±1}). 

We will omit the subscript T and dependence on , for the sake of simplicity. In statistical learning 
theory, Rademacher averages often provide the tightest guarantees on the performance of empirical risk 
minimization and other methods. The next result shows that the Rademacher averages play a key role in 
online convex optimization as well, as the minimax regret is upper bounded by the worst-case (over the 
sample) Rademacher averages. In the next section, we will also show lower bounds in terms of Rademacher 
averages for certain linear games, showing that this notion of complexity is fundamental for OCO. 

Theorem 18. _ 

^t<2Vt sup Rad{t{T)). 

Proof. Let p be an arbitrary joint distribution. Let / be an empirical minimizer over Z'^ , a sequence- 
dependent function. Then 



T 



s=l 



t=l t=l 
as the particular choice of / is (sub)optimal. Replacing the / by the supremum over J- , 



1 1 ^ 

-^t(p) < E y E /) - ^iZu f) 

t=i 

1 ^ 

= E sup - E [Ep,(.|zr^)^(^^ /) - ^iZu f) 

^'^■^ t=l 

1 

<Esnp-J2m'tJ)^£iZtJ)], 



t=i 



where we renamed each dummy variable Z as Z^. Even though Zt and Z^ have the same conditional 
expectation, we cannot generally exchange them keeping the distribution of the whole quantity intact. 
Indeed, the conditional distributions for t > t will depend on Zt and not on Z^. The trick is to exchange 
them one by one[^ starting from t ^ T and going backwards, introducing an additional supremum. (One 
can view the sequence {Z[} as being tangent to {Zt} (see [5]).) To this end, for any fixed G {—1, +1}, 



1 ^ 

E sup-E[^(^^'/)-^(^*'/)] 



T-l 



^ ( f E [^(^*'' /) - ^(^*' /)] + m'^. f) - i{Zt. /)) 



t=i 



'We thank Ambuj Tewari for pointing out a mistalte in our original proof. We refer to II 5J for a similar analysis. 
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because for the last step, indeed, Zt and can be exchanged. Since this holds for any e-r, we can take it 
to be a Rademacher random variable. Thus, 

E E [^(^*'' /) " ^(^*' /)] + (^(^T, /) - ^{Zt, /)) j 

< E,,E sup [i{Z[, f) - liZtJ)] + ^er (^(^^, /) - ^(^t, /)) j 

< sup E^T-.E,,sup f ;^5][^(Z;,/)-£(Z,,/)] + ieTW^^,/)-^(^T,/)) ) , 

where we assumed the worst case over Zt, Z^. The first expectation is now taken over the shorter sequence 
1, . . . , T — 1. Repeating the process, we have that ^^t{p) is bounded by 

^^t{p) < sup E ,T sup ^* (^(^*'' - ^(^f /)) ] 

T 



< 2 sup E sup — 



2^supRad(^(F)). 



□ 



Properties of Rademacher averages are well-known. For instance, the Rademacher averages of a function 
class coincide with those of its convex hull. Furthermore, if £ is Lipschitz, the complexity of can be upper 
bounded by the complexity of multiplied by the Lipschitz constant. For example, we can immediately 
conclude that if the loss function is Lipschitz and the function class is a convex hull of a finite number M of 
functions, the minimax value of the game is bounded by < C^/T log M for some constant C. Similarly, a 
class with VC-dimension d would have log M replaced by d. Theorem[T8]is, therefore, giving us the flexibility 
to upper bound the minimax value of OCO for very general classes of functions. 

Finally, we remark that most known upper bounds on Rademacher averages do not depend on the 
underlying distribution, as they hold for the worst-case empirical measure (see [13], p. 27). Thus, the 
supremum over the sequences might not be a hinderance to using known bounds for Rad(i?(J^)). 

5.3 Linear Losses: Primal-Dual Ball Game 

Let us examine the linear loss more closely. Of particular interest are linear games when J- = B^^.^^^ is a ball 
in some norm | 
bound of 



and Z = -Biui, the two norms being dual. For this case. Theorem 
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gives an upper 



^^%T < 2 supE ,T sup r(l;Y ^t^* I = 2 sup E ,t 



t = l 



^fZt 



(13) 



Fix Zi . . . Zt-1 and observe that the expected norm is a convex function of Z^- Hence, the supremum over 
Zt is achieved at the boundary of Z. The same statement holds for all Zt's. Let zl, . . . , he the sequence 
achieving the supremum. Now take a distribution for round t to be Pf{z) = ^ (l2»(-) + 1_2*(-)) and let 
p* = X . . . X be the product distribution. It is easy to see that 



?T < 2supE,r 
ZT ' 



= 2Ep.E,T 



1 ^ 



= 2E,r 



1 ^ 

t = l 



t=i 



Ep.Rad(J^). 



(14) 
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Also note that pl has zero mean. It will be shown in Section 7.1 that the lower bound arising from this 
distribution is 

1 ^ 



Ep.Rad(J^), 



which is only a factor of 2 away. Thus, the adversary can play a product distribution that arises from the 
maximization in ( 13 1 and achieve regret at most a factor 2 from the optimum. 



6 VL{\/T) bounds for non-differentiable $ 

In this section, we develop lower bounds on the minimax value based on the geometric view-point 
described in Section WA\ 



Theorem 14 shows that the regret is upper bounded by log T for the case of strongly convex losses, and 
this upper bound is tight if the loss functions are quadratic, as we show later in the paper. Thus, flatness 
of <i> implies low regret. What about the converse? It turns out that if <& is non-differentiable (has a point 
of infinite curvature), the regret is lower-bounded by -s/T, and this rate is achieved with p = p-^, where p 
corresponds to a point of non-differentiability of <i>. 

The geometric viewpoint is fruitful here: vertices (points of non-differentiability) of $ correspond to 
exposed faces in the loss class S = co[~i{T)] (see Figure [ij suggesting that the lower bounds of Q.{\/T) 
arise from having two distinct minimizers of expected error — a striking parallel to the analogous results for 
stochastic settings [T2| [14]. 

To be more precise, vertices of as (and <&) translate into flat parts (non-singleton exposed faces) of 
S'(co[— £(J-')]) and the other way around. Corresponding to an exposed face is a supporting hyperplane. If 
is non-negative, then any exposed face facing the origin is supported by a hyperplane with positive 
co-ordinates (which can be normalized to get a distribution). So a non-singleton face exposed to the origin 
is equivalent to having at least two distinct minimizers / and g of Epi'(Z, •) for some p, as discussed in 



Section 4.2 We demonstrate this scenario, along with the supporting hyperplane, {if,p) = {tg,p), in Figure 

m 




Figure 3: The face of the convex hull of the loss class is supported by a probability distribution p. 



Suppose there is a non-singleton face Fs{p) of co[~£{J-)] supported by some distribution p. It is known 
(see [H]) that any exposed face is a convex set. The extreme points of this convex set are vectors £f (and not 
convex combinations such as ^^^^^ ). Furthermore, for all if,ig & Fs{p), we have {£f,p) = {£g,p)- 

Define the set of expected minimizers under p as 

^* := {/ e ^ : E/(Z, /) = inf E/(Z, /)}. 

Thus, —£{T*) C Fs{p) C co[—t{T)\. The lower bound we are about to state arises from fluctuations 
of the empirical process over the set T* . To ease the presentation, we will refer to the sample average 
asE£(Z,/). 
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Theorem 19. Suppose Fs{p) is a non-singleton face of co[—£(J^)], supported by p (i.e. |JF*| > I). Fix any 
f* G J^* and let Q C ^(jT*) be any subset containing £(•, /*). Define Q = {5 — £{■, /*) : g G Q}, the shifted 
loss class. Then for T > To(jF), 



Epe{z,f)-te{z,f) 



= E sup 

> sup E sup Gq , 

T QC£(jr') qeQ 



where Gq is the Gaussian process indexed by the (centered) functions in Q, and c is some absolute constant. 
Proof Recalling that Ee{Z,f) = Mg^yrE£{Z,g) = for all f e T* , we have 

^^T > ^^t(p^) = Hp) -EpHPt) 



1 



> ~E ynf^E^(Z, /) 



~ E sup 



Epe{Z,f)-E£{Z,f) 

Ep£{ZJ)-E£{Z,f) 



> sup E sup 

Now, fix any /* G J-* . The proof of Theorem 2.2 in [TT| reveals that empirical fluctuations are lower 
bounded by the supremum of the Gaussian process indexed by Q- To be precise, there exists Tq(J-') such 
that for T > Tq{!F) with probability greater than ci. 



■mi E{£{ZJ)^£{Z,f*))<-C2 



^ SUPggQ Gq 



T 



for some absolute constants ci, C2. Rearranging and using the fact that E£{Z, f) — E£{Z, /*) = for f E T* , 

E SUPqgQ Gq 



sup 



E£{Z, f) ~E£{Z, f) + El{Z, r ) -El{Z, f*) 



> C2- 



T 



with probability at least ci. The supremum is non- negative because f*£Q and therefore 

E SUPggQ Gq 



E sup E^(Z,/) -E^(Z,/) 



> C1C2- 



T 



□ 



We remark that in the experts case, the lower bound on regret becomes y/Tlog N, as the Gaussian 
process reduces to N independent Gaussian random variables. We discuss this and other examples in the 
next section. 



7 Lower Bounds for Special Cases 

We now provide lower bounds for particular games. Some of the results of the section are known: we show 
how the proofs follow from the general lower bounds developed in the previous section. 
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7.1 Linear Loss: Primal-Dual Ball Game 



Here, we develop lower bounds for the case considered in Section [573| As before, to prove a lower bound it 
is enough to take an i.i.d. or product distribution. In particular, the product distribution described after 
Eq. (13 1 is of particular interest. To this end, choose p = pi x . . . x to be a product of symmetric 
distributions on the surface of the primal ball Z with Ep^Z = 0. We conclude that <&(pt) = and 

(T \ T 

^^Z, =E 
t=i / t=i 



by the definition of dual norm. Now, because of symmetry. 



E 



1 ^ 



= -E,E 
T 



= -EE< 
T 



(15) 



We conclude that > •\/T'ERad(jF), the expected Rademacher averages of the dual ball acting on the 



primal ball. This is within a factor of 2 of the upper bound (14 1 of Section 5.3 Hence, for the linear game 



on primal-dual ball, a product distribution is within a factor 2 from the optimum. Note that this is not true 
for curved losses. 

Now, consider the particular case oi !F = Z = B2, the Euclidean ball. Wc will consider three distributions 



Suppose p is such that pt (-jZ* ^) puts mass on the intersection of B2 and the subspace perpendicular 
to J2lZ\ and E [Z\Zl^^] ^ 0. Then E X^Li ^ VT by unraveling the sum from the end. In 
fact, this is shown to be the optimal value for this problem in [T]. We conclude that a non-product 
distribution achieves the optimal regret for this problem. 



Consider any symmetric i.i.d. distribution on the surface of the ball Z. Note that for this case we 
still have the lower bound of Eq. (151. Kinchine-Kahane inequality then implies > 



and the 



constant %/2 is optimal (see [10') in the absence of further assumptions. 



• Consider another example of an i.i.d. distribution that puts equal mass on two points izp on each 
round, with |jzo|| = 1. It then follows that this i.i.d. distribution achieves the regret equal to the length 

of the random walk E X]t=i ^* ; which is known to be asymptotically 

• We expect that putting a uniform distribution on the surface of the ball will give a regret close to 
optimal \/T as the number of dimensions grows, since Zt is likely to be orthogonal to the sum of 
previous choices, as in the first (dependent) example. 

We conclude that for the Euclidean game, the best strategy of the adversary is a sequence of dependent 
distributions, while product and i.i.d. distributions come within a multiplicative constant close to 1 from it. 



7.2 Experts Setting 

The experts setting provides some of the easiest examples for linear games. We start with a simplified game, 
where T — Z = the A^-simplex. The <I> function for this case is easy to visualize. We then present the 
usual game, where the set Z = [0, 1]^. In both cases, we are interested in lower-bounding regret. 



7.2.1 The simplified game 

Let us look at the game when only one expert can suffer a loss of 1 per round, i.e. the space of actions Z 
contains N elements ei, . . . , cat. The probability over these choices of the adversary is an iV-dimcnsional 
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simplex, just as the space of functions T . For any p G Ajv, 



<i>(n) = min E n-^ • f = min p • f = min p, 

and therefore the $ has the shape of a pyramid with its maximum at = -^1 and ^•(p*) = 1 /iV. The regret 
is lower-bounded by an i.i.d game with this distribution at each round, i.e. 



?T > ^{V*) -£<&([/) = ^ - E min [I- J2 Zt) f 



1 Hi 

N~ T 



= E max 

ie[N] 



where is the number of times e.^ has been chosen out of T rounds. This is the expected maximum deviation 
from the mean of a multinomial distribution, i.e. 1 /N minus the smallest proportion of balls in any bin after 
T balls have been distributed uniformly at random. 

To obtain the lower bound on the maximum deviation, let us turn to Section [6j The convex hull of 
the (negative) loss class co[—£{J-')] is the simplex itself. This is also the face supported by the uniform 
distribution p* . The lower bound of Theorem \T9\ involves the Gaussian process indexed by a set Q. Let 
us take /* — ^1 and J^* = {ei, . . . ^bn} U {/*}. We can verify that EeJZ — ^{p*) — j^, the covariance 
of the process indexed by Q = £{J^*) is ^{elZ — j^){e'jZ — j^) = —-^iori^j and the variance is 
E(eJZ — -^)^ = ^1^- ^'^^ be the Gaussian random variables with the aforementioned covariance 

structure. Then \\Yi — Yj||^ = E{Yi — Yj)'^ — ^. We can now construct independent Gaussian random 
variables {Xi}^ with the same distance by putting on the diagonal of the covariance matrix. By Slepian's 
Lemma, |E supj Xi < E sup, Yi, thus giving us the lower bound 



^ TlogN 

for this problem, for some absolute constant c and T large enough. 
7.2.2 The general case 

In the more general game, any expert can suffer a 0/1 loss. Thus, p is a distribution on 2^ losses Z. To lower 
bound the regret, choose a uniform distribution on 2^ binary vectors as the i.i.d. choice for the adversary. 
We have $ (2^1) = min/gA„ / • EZ = 1/2. As for the other term, E$(Pt) = E min/gA„ / ' E ^t) ■ 
Thus, the regret is 

2 T 



— > E max 

T - ',elN] 



where e^^t are Rademacher {±1}- valued random variables. It is easy to show that the expected maximum is 
lower bounded by c-y/log N/T. This coincides with a result in [S], which shows that the asymptotic behavior 

is Viogiv/(2r). 

7.3 Quadratic Loss 

We consider the quadratic loss, ({z,f) — \\f — This loss function is 1-strongly convex, and therefore 



we already have the O(logT) bound of Corollary 15 In this section, we present an almost matching lower 



bound using a particular adversarial strategy. The problem of quadratic loss was previously addressed in 
[16j : we reprove their lower bound in our framework, borrowing a number of tricks from that work. 

Following Section [6] it is tempting to use an i.i.d. distribution and compute the regret explicitly. Unfor- 
tunately, this only leads to a constant lower bound, whereas we would hope to match the upper bound of 
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logT. We can show this easily: let p := be some i.i.d. distribution, then 



TE^{Pt) = (T- l)E||Zi|p - '^^^^ 



E(Zi,Z2) 



Thus 



(T- 1) (E||Z|p - (EZ)2) = (T- l)var(Z) 

(T-mip). 



where we see that the last term is independent of T. 

Indeed, obtaining logT regret requires that we look further than i.i.d. To this end, define 



CT ■■ = 



1 

T 

ct - 



for all t^T,T -I, 



We construct our distribution p using this sequence as follows. Assume Z — J- — [—1,1] and for convenience 
let Zi-s '■— J2t=i^t- Also, for this section, we use a shorthand for the conditional expectation, Et[-] := 
E [-jZi, . . . , Zt-i]. Each conditional distribution is chosen as 



Pt{Zt = z\Zi, . . .,Zt-i) 



l + CtZi:t-l 

2 • 

1 — CtZl:t-l 

2 ■ 



for z = 1 
for z = —I 



Notice that this choice ensures that KfZt — CtZi-t-i, i.e. the conditional expectation is identical to the 
observed sample mean scaled by some shrinkage factor Ct- That ^'^^^ G [0, 1] follows from the statement 
c* < I which is proven by an easy induction. We now recall a result shown in [16j: 



Lemma 20 (from 16J 



^ct = iogr-iogiogr + o(i) 



t=i 



This crucial lemma leads directly to the main result of this section. 
Theorem 21. With p defined above, Mt{'p) = X]t=i '^"■'^ therefore 

^j,{p) = log T - log log T + o(l) 

Proof For alH = 0, 1, . . . , T, let 



Qt 



E 



J2{Zs-EsZsf + ctZlt-J2z^ 

.5-1 S=l 

+ {Cf+i + Ct+2 + . . . + Ct)] 



We will show by a backwards induction that Qt = i^T(p), from which the result will follow since X]t=i "^t — 

Qo. 

We begin with the base case, Qt — ^t(p)- Recall, miny E (/ — Z)^ = E (Z — E Z)^. At the same time, 
min/ - Ztf = Et Z'i - This implies that 



?t(p) - E 



noting that the conditional expectations Et[-] are unnecessary within the full E [ 
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Wc now show that Qt ~ Qt-i- To begin, we will need to compute the following conditional expectation: 

+ ^ ~ "f-'-' {Zi:t-i - If = Zl,^,{l + 2c.) + 1 
Notice that we may write Qt — Qt-i with the expression 

E [ctZlt ~ Ct-iZlt_, + iZt-E tZtf - Zl - Ct] 



= E 


^t{ctZlt + {Zt-EtZtf - Zl\ - 




= E 


{ct{Zl,_^{\^2ct)^\)-{EtZtf ^ 


- (^t-iZl,f_i — Cf 


= E 


[QZL_i+2c2ZiVi-(ct^i:*-i)' 


- ct-iZl,t_i] 


= E 


[(ct + (?t)Zl,t_^ - Ct^xZl,t_^ 




= E 


[ct-iZlt-i - Ct-iZl,t-i\ = 0- 





Hence, Qt = Qt-i and we are done. □ 

8 A Few More Results on $ 

The following result is in [9 , Theorem 3.3.1. 

Theorem 22. Let Si and S2 be nonempty closed convex sets and a\, 02 are their respective support functions. 
Then 

Si <Z S2 <^ <Ji{x) < 0-2(2;) for all X eR''- 

Hence, taking a subset of co [— i?(jF)] leads to a lower bound on the support function and, hence, on <i>. 
We have an obvious corollary: 

Corollary 23. Suppose Si C S C S2, where S — co[— ^(JF)]] and all the sets are nonempty, closed and 
convex. Then 

$1 < $ < $2, 

where ^^(p) min_^^gSi 

We remark that argmin of can now be a loss vector of a function not found in J-. However, this 
possibility can be eliminated if Si are constructed as convex hulls of a subset of — ^(^) . 
Let us now consider linear transformations of the loss class. 

Proposition 24 (Proposition 3.3.3 in [9 ). Let A : M" M'" be a linear operator, with adjoint A* (for 
some scalar product ((•,•,)) in M™ ) . For S C M" nonempty, we have 

<yciA(s)iy)^<^s{A*y) forallyeM" 

We can now study linear transformations of the set co [— ^(J^)]. Suppose A is a linear invertible transfor- 
mation and assume the set S contains a non-singular exposed face, implying a VT rate. If the transformation 
A is such that the set A{S) still contains the exposed face (i.e. does not rotate it away from the origin), 
then the minimax regret over the modified <i> is also VT. Moreover, it should be differing from the original 
regret by a property of A, such as the condition number. We can use the idea of transformations A to define 
isomorphic learning problems, i.e. those which can be obtained by some invertible mapping of the loss class. 



20 



References 

[1] J. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower bounds 
for online convex games. In COLT, 2008. 

[2] P. L. Bartlett, E. Kazan, and A. Rakhlin. Adaptive online gradient descent. In NIPS, 2007. 

[3] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffc. Convexity, classification, and risk bounds. 
Journal of the American Statistical Association, 101(473):138-156, 2006. 

[4] Jonathan M. Borwein and Adrian S. Lewis. Convex Analysis and Nonlinear Optimization. Advanced 
Books in Mathematics. Canadian Mathematical Society, to appear. 

[5] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Cames. Cambr. Univ. Press, 2006. 

[6] Victor de la Peiia and Evarist Gine. Decoupling: From Dependence to Independence. Springer, 1998. 

[7] B. A. Frigyik, S. Srivastava, and M. R. Gupta. Functional bregman divergence and bayesian estimation 

of distributions. CoRR, abs/cs/0611123, 2006. 

[8] Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Logarithmic regret algorithms for online 
convex optimization. In COLT, pages 499-513, 2006. 

[9] Jean-Baptiste Hiriart-Urruty and Claude Lemarechal. Fundamentals of Convex Analysis. Springer, 
2001. 

[10] R. Latala and K. Oleszkiewicz. On the best constant in the Khinchin-Kahane inequality. Studia Math., 
109(1):101-104, 1994. 

[11] G. Lccuc and S. Mcndclson. Sharper lower bounds on the performance of the empirical risk minimization 
algorithm. Available at http://www.cmi.univ-mrs.fr/ lecue/LM2.pdf. 

[12] W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning with squared 
loss. IEEE Transactions on Information Theory, 44(5):1974-1980, 1998. 

[13] S. Mendclson. A few notes on statistical learning theory. In S. Mendelson and A. J. Smola, editors. Ad- 
vanced Lectures in Machine Learning, LNCS 2600, Machine Learning Summer School 2002, Canberra, 
Australia, February 11-22, pages 1-40. Springer, 2003. 

[14] S. Mendclson. Lower bounds for the empirical minimization algorithm. IEEE Transactions on Infor- 
mation Theory, 2008. To appear. 

[15] K. Sridharan and A. Tewari. Convex games in banach spaces (working title), 2009. Unpublished. 

[16] E. Takimoto and M. Warmuth. The minimax strategy for gaussian density estimation. In COLT, pages 
100 106. Morgan Kaufmann, San Francisco, 2000. 

[17] V. Vovk. Competitive on-line linear regression. In NIPS '97: Proceedings of the 1997 conference on 
Advances in neural information processing systems 10, pages 364-370, Cambridge, MA, USA, 1998. 
MIT Press. 

[18] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 
pages 928-936, 2003. 



21 



Appendix 



Proof of Theorem^ for general T. Consider the last optimization choice zt in Eq. ([TJ. Suppose we instead 
draw Zt according to a distribution, and compute the expected value of the quantity in the parentheses in 
Eq. ([l]). Then it is clear that maximizing this expected value over all distributions on Z is equivalent to 
maximizing over zx, with the optimizing distribution concentrated on the optimal point. Hence, 



?rp = inf sup • • • inf sup inf sup E Zt~pt 



■ T T ■ 

^^(Zi,/t)- inf ^^(zt,/) 

t=i ' t=i 



(16) 



In the last expression, it is understood that sums are over the sequence {zi, . . . , z^-i, Z^}- The first T — 1 
elements are quantified in the suprema, while the last Zx is a random variable. Let us adopt the following 
notation for the conditional expectation: Ef[X] = E ^j^pjXjz*"^]. 
We now apply Proposition |2] to the last inf / sup pair in ([T6| with 

T T 



M(/t,pt) =E: 



inf 

t=i t=i 



which is convex in fx (by assumption) and linear in px- Moreover, the set J-' is compact, and both J- and 
^ are convex. We conclude that 



— inf sup 



inf 



sup sup inf E T 



^£(zt,/i)- m^^£(z,,/) 



,t=i 



inf sup 



= inf sup 



CT-i T \ 

y2£{ztjt)+ inf Et[^(^t,/t)]-Et inf V^(zt,/) 
t~i -^^^"^ ^'^''^1=1 / 



inf 



sup 



y2e{zt,ft)+ sup { M ET[i{ZTjT)]~ETmf Te{zt,f) 



(17) 



= inf sup • • • inf sup \y^£{zt,ft)+ 

fi<^^ ziez fT-2(^:F zj._2ez 

inf sup £{zt-iJt-i)+ sup (inf Et [^(^t, /t)] - Et inf V £(zt, /) I ), 
fr-ieF ZT-iez \_ pTi^S^ yfT(^^ ^^•'^t=L J J/ 

As we swap inf /sup from inside out, Zf's are taken to be random variables and denoted by Zt- Below, we 
replace Z^'s in the infima over ft of conditional expectations by a dummy variable Z. 

It is important to note that the maximizing distribution px depends on the previous choices Zi~^, but 
not on any of the /i's. As before, we can replace the supremum over zt-i by a supremum over distributions 
Pt~i- Noting that the expression inside of square brackets is convex in /t-i and linear in a distribution 
Pt^i on Zt-1, we invoke Proposition |2] again to obtain 

(T-2 
e{zt,ft)+ 



sup inf 



£{Zt-i, fr-i) + sup <^ inf E t [^(^, /t)] - Et inf V ^(zt, /) 



i'T-2 



= inf sup ••• inf sup > £{zt,ft)- 
h<^^ ziez fT-2eJ^ ZT-2ez \~! 



sup 



inf ET^^[i{ZjT^,)])+ET-i sup I inf E ^ [^(^, /t)] - E t inf V ^(z*, /) | ). 
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Again, it is understood that in the last term, the sum is ranging over {zi, . . . , zt-2, Zt-i, Zt}- Since 
the term in round brackets (involving iidfr^ -^) does not depend on pT or Zt^i, we can pull it inside the 
supremum: 



/T-2 



= inf sup • • • inf sup \ "S^ ({zt, ft)- 



, t=i 



sup Et-1 sup { inf ET-i[£{Z,fT-i)]+ inf E t m Jt)] - E t mf e{zt J)\] . 

We now argue that choosing a distribution pt-i, averaging over the Z^-i under the maximizing distribution, 
and then maximizing over the conditional pt{-\Zt-i) is the same as maximizing over the joint distributions 
Pt-i,t over (Zt-i^ Zt) and then averaging. Hence, 



/T-2 



^T= inf sup--- inf sup £{zt,ft) + 

sup Et-A inf Et-i[£(^,/t-i)]+ inf E t [^(^, /t)] - Et inf V ^(zt, /) | ) . 

Pt-i.t yfr-i^^ /rS^ -^^^ J / 

Comparing this to ( 17 ) , we observe that the process can be repeated for the inf / sup pair at time (T — 2) 
and so on. 

□ 

Proof of Lemma^ Concavity of $ is easy to establish. For any distributions p and q, we see that 
^ (^) ^ /) > i MEpiiZ, /) + i m^E/(Z, /) = ^ (^p) + <f>(g)) . 

As for concavity of let Pa '■— ap+ (1 — a)q and note the simple calculation of the conditional probability 
dp^{Zt\Zl') = ^MZl -')dp{Zt\Zl') + (1 - a)dq{Zl-')dq{Zt\Zl-') 



adp{Z\^') + (1 - a)dq{Zl-') 
We can now show that 

T 

Y^e,Mpo.{-\z{-')) 

L 

T 

^ / ^{p^{-\Z['^)){adp{Zl-') + {l^a)dq{Z{-^)) 

4. 1 ^ 



t=l 

T 



t=i 



^ adp{Z{-^)<^{p{-\Z\-^)) + (1 - 



ad^3(Z*-i) + (1 - a)dq{Z[-^) 
X (adp(Zj-i) + (1 - a)dq{Z{-^)) 



= ^ aEp<i>(p(-|Z^i)) + (1 - a)E,<^{q{-\Z[-')). 



Thus, the first term in the regret is concave with respect to the joint distribution. In addition the second 
term is clearly linear, since 

-Ep^$(Pt) = -aEp$(FT) - (1 - a)Eg$(PT)- 
Since a linear plus a concave function is still concave, ^t(') is concave. □ 
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Proof of Lemma [7g| Since £ is cr-strongly convex, we have (by taking f ^ fp, g = fq in the definition of 
strong convexity) 



e{zjp) + i{zjq) 



>i z 



fp + h 



+ oll/p-/,ll 



for any z. Taking expectations with respect to z ^ p and noting that fp minimizes Ep£(z, /), we have 



l{zjp) + i{z,fq) 



> EJ 



pi I z, 



fp + /, 



>Epe{zjp) + -\\fp-fq\f 



Rearranging terms, 
Similarly, 
Adding, 

Using the Lipschitz condition. 



-Wfp^ fqf <Ep£iz,fq)-Ep£{zJp). 



-Wfp- fgf <Eq£{z,fp)-Eg£{z,fq). 



\\fp-f,r< / [£{z,fq)~e{zjp)]{dp{z)~dq{z)). 



\\fp-fX< J \i{zjg)-e{zjp)\-\dp{z)-dq{z)\ 
<L\\fp-fg\\-\\p-q\\i 



Thus, 



o r 

\\fp-f,\\<-\\p-q\\l, 
which establishes the main building block resulting from the curvature. 



□ 



Lemma 25. If £ satisfies the conditions of Theorem 12 then £{-,fp) is a subdifjerential of ^ at p. 



Proof. We claim that £{■, fp) is the differential of $ at the point p and, therefore, £{z, fp){dq{z) — dp{z)) = 
(V<i>(p), {q — p)) is the derivative in the direction q — p. By definition, the differential is a function V<i> such 
that 

^p) - + h)- V^(p) ■ h 
hm — = 0. 

h^o \\h\\ 

Hence, it remains to check that for any distribution r 



Rewriting, 



mi-a)p+ar)-^p)- JJ{zJp){a{dr{z)-dp{z))) 
lim — 0. 

a^O a 



<i>{{l~a)p+ar)-^p)- / £{z, fp){a{dr{z) ~ dp{z))) 

J z 

= mmE(^i_„yp^„J{z,f) - (1 - Q;)mmEp^(z,/) - aE,.£(z,/p) 

= mm [(1 - a)Ep£{z, f) + aE r£{z, /)] - (1 - a)Epl{z, fp) ~ aEr£{z, fp) 



(18) 
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It is evident that the above expression is non-positive by substituting a particular choice of fp in the first 
minimum. For the lower bound, use the bound of Eq. ( 11 ) 



E(l-Q)p+Qr^(Z:/(l 

)-{l-a)Epe{zJp)-aEr£{zJp) 

E [l — a)p-\-ar^{^-i f{l — a)p-\-ar) ~ {l — a)p-\-ar^{^-i fp) 

2L2 



> 



-\Hp - r)\\i = -e{a') 



Thus, Eq. (181 is verified. 



□ 
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